Figure Ground Maps by Geoff Boeing

Analysing Urban Neworks

Introduction

Networks are ubiqutous in a urban systems. The more obvious ones are street, rail and utility networks that undergird the city. However, as planners we also focus on less obvious networks; e.g. social networks of people that allow for community formation, the transactional activity of firms that allow us to identify clusters of firms or industries, the mismatch between job seekers and job locations to better plan for social and physical infrastructure and information flow through organisational linkages during a disaster management process. These are but a few examples of network analysis in the urban domain. The problem definition frames the analytical tools used in the process.

Acquire Data

In this post, I am going to explore multiple datasets in different problem domains to demonstrate the importance and applicability of network analysis in cities. Therefore, we will use different datasets and are linked in different sections, if they are not automatically downloadable.

Key nodes in a network

Let’s begin by using the bike share data we are familiar with. We already did some network analysis with it (inadvertently). Let’s formalise it.

Networks are representation of connections between two nodes. As such bike stations, could be treated a vertex/node and each trip between any two station could be treated as a link. In another representation, you can treat all stations connected with one another, if they belong to the same company that you can check out and return the bikes and treat the number of trips as weights on that edge. So some edges will have zero weights. In some instance, you may disallow edges that have zero weights, i.e. remove the connection between two stations if there are no trips between them (or number of trips below a threshold).

Much of the following is data cleaning that you should be familiar with now.

library(tidyverse)
library(igraph)

tripdata <- "/Users/kaza/Dropbox/urban_analysis_tutorials/datasets/2.geospatialR/201806-citibike-tripdata.csv" %>% read_csv()

tripdata <- rename(tripdata,              #rename column names to get rid of the space
                   Slat = `start station latitude`,
                   Slon = `start station longitude`,
                   Elat = `end station latitude`,
                   Elon = `end station longitude`,
                   Sstid = `start station id`,
                   Estid = `end station id`,
                   Estname = `end station name`,
                   Sstname = `start station name`
                   
)
diffdesttrips <- tripdata[tripdata$Estid != tripdata$Sstid, ] # to make sure there are no loops or self-connections. 





(trips_graph <- diffdesttrips %>% 
        select(Sstid,Estid) %>%
        graph.data.frame(directed = T)) #We are using directed graph because the links have from and to edges. You can choose to ignore them.
# IGRAPH 690872b DN-- 773 1909858 -- 
# + attr: name (v/c)
# + edges from 690872b (vertex names):
#  [1] 72->173  72->477  72->457  72->379  72->459  72->446  72->212 
#  [8] 72->458  72->212  72->514  72->514  72->465  72->173  72->524 
# [15] 72->173  72->212  72->382  72->462  72->3141 72->456  72->519 
# [22] 72->328  72->426  72->3659 72->359  72->525  72->405  72->458 
# [29] 72->457  72->3163 72->426  72->328  72->376  72->3459 72->304 
# [36] 72->457  72->486  72->527  72->490  72->459  72->459  72->3178
# [43] 72->3178 72->515  72->459  72->388  72->500  72->508  72->173 
# [50] 72->426  72->368  72->3258 72->532  72->478  72->514  72->533 
# + ... omitted several edges


vcount(trips_graph)
# [1] 773
ecount(trips_graph)
# [1] 1909858
is.directed(trips_graph)
# [1] TRUE

You can access the vertices and edges using V and E functions.


V(trips_graph) %>% head()
# + 6/773 vertices, named, from 690872b:
# [1] 72  79  82  83  119 120
E(trips_graph) %>% head()
# + 6/1909858 edges from 690872b (vertex names):
# [1] 72->173 72->477 72->457 72->379 72->459 72->446

tmp1 <- diffdesttrips %>%
  group_by(Sstid) %>%
  summarise(
    stname = first(Sstname),
    lon = first(Slon),
    lat = first(Slat))%>%
  rename(stid = Sstid)

tmp2 <- diffdesttrips %>%
  group_by(Estid) %>%
  summarise(
    stname = first(Estname),
    lon = first(Elon),
    lat = first(Elat)) %>%
  rename(stid = Estid)

station_locs <- rbind(tmp1, tmp2) %>% unique()

You can extract subgraphs for a subset of vertices

set.seed(200) # For reproducibility because of randomisation below
station_sample <- sample(V(trips_graph), 20)
sub_trips <- induced_subgraph(trips_graph, station_sample)


# plot using ggraph
library(ggraph)
ggraph(sub_trips, layout = 'kk') + 
    geom_edge_fan(show.legend = FALSE) +
    geom_node_point()

Note that the graph does not respect the geographic locations. If you want to fix the positions relative to their lat/long coordinates, you should can specify them using layout parameters.


Exercise

  • Use different layouts (such as star and cricle) to visualise the network.
  • Use different edge attributes to style the edges of the above graph
  • Style the vertices using for e.g. number of incoming trips per day, or their location in different boroughs.

Degrees & their distribution.

Degree of a vertex, in graph theory, refers to the number of edges that has the vertex as one of its ends. An in-degree and out-degree qualifies it, by counting only edges that start and end at the vertex. Degree distributions are important for graphs as they tell us about how connected the graph is and which if any of the vertices are more important than others.

degree(trips_graph, mode = 'out') %>%
  as.tibble %>%
  ggplot()+
  geom_density(aes(x=value))


degree.distribution(trips_graph) %>% head() # Check out what the degree.distribution produces and how to interpret the results.
# [1] 0.000000000 0.003880983 0.005174644 0.002587322 0.001293661 0.001293661

In particular, you want to look for the outliers in the distribution. For example to find the stations with only end trips but no start trips, i.e stations that are solely destinations

V(trips_graph)$name[degree(trips_graph, mode="out") == 0 & degree(trips_graph, mode="in") > 0]
#  [1] "3245" "3267" "3275" "3192" "3276" "3184" "3214" "3651" "3198" "3199"
# [11] "3203"

We can also visualise these stations that are popular origins on the map. However, we need to attach the values to the station locations tibble.



tmp <- degree(trips_graph, mode = 'out') %>%
  as.tibble() %>%
  rename(Outdegree = value)%>%
  mutate(stid = V(trips_graph)$name %>% as.numeric()) 


station_locs <- station_locs %>% 
  left_join(tmp, by='stid')

library(leaflet)

 Npal <- colorNumeric(
   palette = "Reds", n = 5,
   domain = station_locs$Outdegree
 )
 
station_locs %>%
    leaflet()  %>%
  addProviderTiles(providers$Stamen.TonerLite, group = "Basemap") %>%
   addCircles(
     lng = station_locs$lon,
     lat = station_locs$lat,
     radius = (station_locs$Outdegree - mean(station_locs$Outdegree))/sd(station_locs$Outdegree) * 30,
     fillOpacity = .6,
    fillColor = Npal(station_locs$Outdegree),
     group = 'Stations',
     stroke=FALSE
   ) %>%
  addLegend("topleft", pal = Npal, values = ~Outdegree,
            labFormat = function(type, cuts, p) {
              n = length(cuts) 
              paste0(prettyNum(cuts[-n], digits=0, big.mark = ",", scientific=F), " - ", prettyNum(cuts[-1], digits=0, big.mark=",", scientific=F))
            },
            title = "Out degree/Trip Starts",
            opacity = 1
  )

You can also find out which stations are predominantly origin stations and which are predominantly destinations. I use 20% more as a cut off (arbitrary).

# Careful with division. If you have a nodes that are only origins, i.e. in-degree is 0, you will have a divison by 0 problem.

V(trips_graph)$name[degree(trips_graph, mode="out") / degree(trips_graph, mode="in") > 1.2]
#  [1] "216"  "258"  "266"  "289"  "339"  "356"  "391"  "397"  "399"  "443" 
# [11] "469"  "3152" "3161" "3164" "3177" "3302" "3316" "3366" "3536" "3539"
# [21] "3542" "3620" "3623"

Exercise

  • Identify the popular origin stations of the trips by different times of the day and day of the week
  • Modify the above code slightly to identify the popular destinations by different times of the day and day of the week.

Other centrality measures

Eigen centrality

If \(A = (a_{ij})\) be the adjacency matrix of a graph, where \(a_{ij} = 0\) when \(i\) and \(j\) nodes are connected, \(0\) otherwise. The eigenvector centrality \(x_{i}\) of node \(i\) is given by: \[x_i = \frac{1}{\lambda} \sum_{k} a_{ki} \, x_k\] where \(\lambda \neq 0\) is a constant.

Intutively, a node is considered more central/important, if it is connected to large number of highly central/important nodes. So, a node may have a high degree score (i.e. many connections) but a relatively low eigen centrality score if many of those connections are with similarly low-scored nodes.

station_locs$eigencentrality <- eigen_centrality(trips_graph)$vector

Page rank centrality

Page rank is a version of eigen centrality, made popular by the founders of Google. The Boxplot function is from car package that helpfully identifies the outliers, which you can use to visualise.

library(car)
k <- page_rank(trips_graph)$vector %>% Boxplot()

station_locs[k, 'stname']
# # A tibble: 10 x 1
#    stname                
#    <chr>                 
#  1 Pershing Square North 
#  2 West St & Chambers St 
#  3 E 17 St & Broadway    
#  4 Broadway & E 22 St    
#  5 W 21 St & 6 Ave       
#  6 12 Ave & W 40 St      
#  7 Kent Ave & N 7 St     
#  8 Broadway & E 14 St    
#  9 W 33 St & 7 Ave       
# 10 Central Park S & 6 Ave

Closeness centrality

While the above measures of centrality depend on local neighborhood, we can also take into account the whole graph. For example, if a station can every station within a few hops, then it is considered more central than ones that may have more localised trips within the neighborhood.

k <- closeness(trips_graph, mode='all') %>% Boxplot()

station_locs[k, 'stname']
# # A tibble: 7 x 1
#   stname                
#   <chr>                 
# 1 Liberty Light Rail    
# 2 Newport Pkwy          
# 3 Essex Light Rail      
# 4 NYCBS DEPOT - DELANCEY
# 5 Morris Canal          
# 6 Columbus Drive        
# 7 Bike Mechanics at Riis

There are hundreds of centrality measures out there. Be careful to pick and choose the right ones for your proiblem. To assist with this, you can use CINNA (Central Informative Nodes in Network Analysis) package for computing, analyzing and comparing centrality measures submitted to CRAN repository.

library(CINNA)
proper_centralities(trips_graph)
#  [1] "Alpha Centrality"                                
#  [2] "Bonacich power centralities of positions"        
#  [3] "Page Rank"                                       
#  [4] "Average Distance"                                
#  [5] "Barycenter Centrality"                           
#  [6] "BottleNeck Centrality"                           
#  [7] "Centroid value"                                  
#  [8] "Closeness Centrality (Freeman)"                  
#  [9] "ClusterRank"                                     
# [10] "Decay Centrality"                                
# [11] "Degree Centrality"                               
# [12] "Diffusion Degree"                                
# [13] "DMNC - Density of Maximum Neighborhood Component"
# [14] "Eccentricity Centrality"                         
# [15] "Harary Centrality"                               
# [16] "eigenvector centralities"                        
# [17] "K-core Decomposition"                            
# [18] "Geodesic K-Path Centrality"                      
# [19] "Katz Centrality (Katz Status Index)"             
# [20] "Kleinberg's authority centrality scores"         
# [21] "Kleinberg's hub centrality scores"               
# [22] "clustering coefficient"                          
# [23] "Lin Centrality"                                  
# [24] "Lobby Index (Centrality)"                        
# [25] "Markov Centrality"                               
# [26] "Radiality Centrality"                            
# [27] "Shortest-Paths Betweenness Centrality"           
# [28] "Current-Flow Closeness Centrality"               
# [29] "Closeness centrality (Latora)"                   
# [30] "Communicability Betweenness Centrality"          
# [31] "Community Centrality"                            
# [32] "Cross-Clique Connectivity"                       
# [33] "Entropy Centrality"                              
# [34] "EPC - Edge Percolated Component"                 
# [35] "Laplacian Centrality"                            
# [36] "Leverage Centrality"                             
# [37] "MNC - Maximum Neighborhood Component"            
# [38] "Hubbell Index"                                   
# [39] "Semi Local Centrality"                           
# [40] "Closeness Vitality"                              
# [41] "Residual Closeness Centrality"                   
# [42] "Stress Centrality"                               
# [43] "Load Centrality"                                 
# [44] "Flow Betweenness Centrality"                     
# [45] "Information Centrality"                          
# [46] "Dangalchev Closeness Centrality"                 
# [47] "Group Centrality"                                
# [48] "Harmonic Centrality"                             
# [49] "Local Bridging Centrality"                       
# [50] "Wiener Index Centrality"

Community Detection

Finding a community structure in a network is to identify clusters of nodes that are more strongly connected within the cluster than to nodes outside the cluster. There are many algorithms that identify community structures. Some work on weigthed networks and some work on digraphs. In any case it is better to convert multigraph (graph with two nodes having multiple edges) into a simple graph with edge weights, representing the number of trips.

E(trips_graph)$weight <- 1
station_graph  <- simplify(trips_graph, edge.attr.comb="sum")

ecount(station_graph) == ecount(trips_graph)
# [1] FALSE
sum(E(station_graph)$weight) == ecount(trips_graph)
# [1] TRUE
all.equal(vcount(station_graph), vcount(trips_graph))
# [1] TRUE

is.directed(station_graph)
# [1] TRUE

Walktrap algorithm finds the communities by random walks in the graph. The intuition is that random walks are more likely to be within a community that across communities.

wlktrp_mmbr <- data.frame (clstrmm  = cluster_walktrap(station_graph)$membership %>% as.factor, 
                          stid = V(station_graph)$name %>% as.numeric()
                          ) %>% as.tibble()


wlktrp_mmbr$clstrmm %>% summary()
#   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
#  80 218 207  61   3 195   1   1   1   1   1   1   1   1   1

It seems like there are 5 distinct and non-trivial clusters of stations. You can visualise these results by joining them back to tibbles with location information.


station_locs <- station_locs %>%
                inner_join(wlktrp_mmbr, by='stid')

library(ggmap)
nybg <- get_map(location=c(-74.0060,40.7128), maptype = 'toner-lite', zoom = 11)

library(RColorBrewer)
numColors <- levels(station_locs$clstrmm) %>% length()
myColors <- colorRampPalette(brewer.pal(8,"Dark2"))(numColors) # Expanding the number of colors available from 8 

  ggmap(nybg)+
  geom_point(aes(x=lon, y=lat, color=clstrmm), alpha=.9, data=station_locs)+
  scale_color_manual(values = myColors)+ 
     scale_x_continuous("", breaks=NULL)+
   scale_y_continuous("", breaks=NULL)+
   theme(panel.background = element_rect(fill='white',colour='white'), legend.position = "none") +
  labs(colour = "Communities") 


Exercise

  • Do the clusters change if you use a different algorithm?
  • Are the clusters different for customers and subscribers? Men and Women?

Identifying clusters of deprivation

In addition to obvious networks, we can also use graph theory to think though some non-obvious application. One place this often crops up in planning is to think about adjacency/proximity as a graph. See my post on employment centers and paper on delineating areas.

In this section, I want to demonstrate how to use networks for other spatial analysis. We can think of polygons as nodes and two nodes are connected if they share some part of the boundary. I also want to demonstrate Kyle Walker’s excellent tigris and tidycensus packages that allow us to conveniently download Census data.

First you need to acquire an API key from census.

We will use the API key to access American Community Survey (ACS) data.The American Community Survey (ACS) is an ongoing survey by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, such as ancestry, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. About 3.5 million people are surveyed every year and multi-year summaries are provided at different levels of geography (not at block level).

Because it is a survey, each variable comes with a margin of error. These are widely ignored in practise and there is a tendency to take the estimates at face value. This practise is problematic.

library(tidycensus)
library(sf)
library(tidyverse)
library(mapview)
census_api_key(census_apikey)
varnames <- load_variables(2016, "acs5", cache = TRUE)
write_csv(varnames, path="/Users/kaza/Desktop/acs_varnames.csv")

It is useful to write the variables and their descriptions into an external file that you can use in other programs and this is usually handy for finding variables of interest. For example, to find the poverty rate in Carolinas’ census tracts, you can use B17001_01 (persons for whom poverty level is tracked) and B17001_002 (persons whose income in the last 12 months was below poverty level).

NC_SC <- c('37', '45') # FIPS code for NC and SC.
pov <- reduce(
  map(NC_SC, function(x) {
    get_acs(geography = "tract", variables = "B17001_002",  summary_var = 'B17001_001',
            state = x, geometry = TRUE, year = 2016)
  }), 
  rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm
nrow(pov)
# [1] 3298

pov_rate <- pov %>%
  rename(population = summary_est) %>%
  filter(population>0)%>%
  mutate(pov_rate = estimate/population) %>%
  select(GEOID, NAME, population, pov_rate)

library(mapview)
mapview(pov_rate, 
        zcol=c('pov_rate'),
        col.regions = brewer.pal(5, "Reds"),
        at=quantile(pov_rate$pov_rate, seq(0,1,.2)),
        lwd=0
        )

Unemployment rate

Unemployment rate is trickier, because it needs to track both the labour force as well as persons unemployed. It turns out that ACS only provides this data for males and females separtely and by differnt age categories. So we need to assemble them and combine them.

# variables for labour force for age groups 16-64
lf_m <- paste("B23001_", formatC(seq(4,67,7), width=3, flag="0"), "E", sep="")
lf_f <- paste("B23001_", formatC(seq(90,153,7), width=3, flag="0"), "E", sep="")

lf <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(lf_m, lf_f), state=x, year=2016)
  }),
  rbind
)
# variables for unemployed for age groups 16-64
unemp_m <- paste("B23001_", formatC(seq(8,71,7), width=3, flag="0"), "E", sep="")
unemp_f <- paste("B23001_", formatC(seq(94,157,7), width=3, flag="0"), "E", sep="")

unemp <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(unemp_m, unemp_f), state=x, year=2016)
  }),
  rbind
)
lf_t <- lf %>% 
  group_by(GEOID) %>%
  summarize(lf_est = sum(estimate, na.rm=T))

unemp_t <- unemp %>% 
  group_by(GEOID) %>%
  summarize(unemp_est = sum(estimate, na.rm=T))

unemp_rate <- merge(lf_t, unemp_t, by='GEOID') %>% 
  filter(lf_est >0) %>%
  mutate(unemp_rate = unemp_est/lf_est)

Exercise

  • Compute unemployment rate for more meaningful classes (Women, Youth etc.) and visualise it. Are there differences in spatial distribution?

We can arbitarily define distress to be a combination of poverty and unemployment. In this instance, I am picking 20% and 10% as the threshold respectively.

tract_stats <- merge(pov_rate, unemp_rate, by='GEOID')
distressed_tracts <- tract_stats %>%
                     filter(unemp_rate > .10 & pov_rate > .20)

mapview(distressed_tracts)

There are reasons to presume that clusters of distress is more problematic than isolated distress. We could use the contiguity of distressed tracts to our advantage to figure where these clusters are and how they are shaped.

It turns out that the contiguity calculations in sf are still a little slower than older packages such as spdep. So I will convert the simple features into a Spatial object and use older packages.

library(rgeos)
library(spdep)

dt <- as(distressed_tracts, "Spatial") # Convert to spatial
temp1 <- gUnarySTRtreeQuery(dt) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
dt_nb <- poly2nb(dt, queen=FALSE, foundInBox=temp1, row.names = dt$GEOID) # Construct a neighborhood object

plot(dt)

plot(dt_nb, coordinates(dt), col='red', points=F)


distressed_tracts_graph <- dt_nb %>%  
  nb2mat(zero.policy=TRUE, style="B") %>%
  graph.adjacency(mode='undirected', add.rownames=NULL) # construct an undirected graph from the adjacency matrix.

cl <- clusters(distressed_tracts_graph)

ggplot() + 
  geom_histogram(aes(cl$csize))


sum(dt$GEOID!= names(cl$membership)) # this is 0. So we can safely assign the clustermembersip variable by rows rather than doing a merge.
# [1] 0
dt$clustermembership <- cl$membership

Clusters function decomposes the graph in subgraphs, i.e. nodes belong to a subgraph, if there is a path between them. If there is not, they belong to different subgraphs.

pal <- colorRampPalette(brewer.pal(8, "Dark2"))
numColors <- cl$membership %>% unique() %>% length()

dt %>% st_as_sf() %>%
  mapview(zcol="clustermembership", 
          col.regions=pal(numColors),
          legend = FALSE,
          lwd = 0
          )

Wihtin in each component of the graph, you can identify a community structure if you wish, using techniques from above.

In some instances, it may be useful to see if the clusters are stringy’ or if they are ‘blobs’. i.e., if the clusters follow linear features (such as roads, valleys, rivers etc) or if they adjacency is shared among multiple polygons of the same component. To do this, we could rely on diameter of the graph, which is the longest path between any two nodes in graph. If the graph is stringy, it will have a large diameter.

In this instance, we want to compute the diameters of each component separately. (diameter of the graph is \(\infty\), as it is not fully connected. )


diameters_of_graphs <-   distressed_tracts_graph %>%
       decompose() %>%
       sapply(function(x)diameter(x)) %>% 
       as.tibble()

names(diameters_of_graphs)[1] <- 'dia'

diameters_of_graphs$cl_membership <- 1:(distressed_tracts_graph %>% decompose() %>% length())

dt <- merge(dt, diameters_of_graphs, by.x='clustermembership', by.y='cl_membership')

plot(dt[dt$dia > 10,'GEOID'], col='red') # Quickly showing the stringy and blobby (real words) distress
plot(dt[dt$dia < 3,'GEOID'], col = 'blue', add = T)

You will notice that most of large spatial concentrations of distress is primarily in the rural ares (midlands of South Carolina and Eastern Carolina Plains). However, this may be misleading. There are concentrations of distress within the urban regions. Since census tracts are very small, they are more likely to be clustered and separated. So the unequal sizes of geography affects the conclusions you can draw.


Exercise

  • Limiting your analysis to Metropolitan Areas in Carolinas, what conclusions can you draw about the shape of the clusters?

  • Does limiting your analysis to a single state, affect your results?


Instead of categorising the areas of distress, one could use local Moran's I to identify locations high values of a continous variable (say poverty rate) surrounded by other high values areas (or low value areas) and test their statistical signifance. local.moran is a function in spdep and relies on the neighbourhood graph. I leave this as an exercise.

Conclusions

Network analysis techniques provide a very useful set of tools that we can apply in many contexts in urban analytics. Networks are ubiquitous and the above two are but few of the situations where network tools can be readily applied. They need to be used more, and used judiciously.

Related

comments powered by Disqus